Lab 07 Regression

RE 519 Data Analytics and Visualization | Autumn 2025


In Lab 7, we will go back to the King County sales data from maintained by Andy Krause at Zillow. You can find the datasets and Readme file in this repository.

In Part A, we will use classical regression, which considers more on the model specification, assumptions, and parameter interpretation. In Part B, we will still use regression but under the context of machine learning. We care about the prediction accuracy on unseen test dataset. A seminal paper discusses the two cultures in statistical modeling: Statistical Modeling: The Two Cultures. Leo Breiman. 2001.

The due day for each lab can be found on the course wesbite. The submission should include Rmd, html, and any other files required to rerun the codes.

From Lab 4, the use of any generative AI tool (ChatGPT, Copilot, etc.) is allowed for coding assignments. But, I still encourage you to write codes by yourself and use AI tools as a way to debug and explore new knowledge. More information about Academic Integrity and the Use of AI.


Lab 07-A: Regression in Statistics

#install.packages("modelsummary") # you only need to run the installation once
#install.packages("devtools") # if you are using a Windows computer
#devtools::install_github('andykrause/kingCoData') # you only need to run the installation once
library('tidyverse')
library(modelsummary) # the package for formatting the regression result table
library('kingCoData') # load the data package
data(kingco_sales) # load the sale data
# only select the sales in 2023
sales <- kingco_sales %>% 
  filter(sale_date >= as.Date("2023-01-01") & sale_date <= as.Date("2023-12-31"))

A Simple Linear Regression

Checking Correlation

Correlation (\(r\)) measures the strength and direction of a linear relationship between two variables. The most common one for continuous numeric data is Pearson correlation, which ranges from -1 to 1.

Note: Correlation is a useful starting point for selecting variables. We often use \(|r| > 0.3\) as a rough criterion for inclusion. But we still need to incorporate domain knowledge and some diagnostic testing. For example, correlation between sale price and year built is only ~0.1, which suggests a weak linear relationship, but it may still be theoretically relevant.

cor(sales[, c("sale_price", "sqft", "year_built", "condition")])
##             sale_price        sqft   year_built   condition
## sale_price 1.000000000  0.53952741  0.009373296  0.01784696
## sqft       0.539527410  1.00000000  0.297844019 -0.05792704
## year_built 0.009373296  0.29784402  1.000000000 -0.35992275
## condition  0.017846960 -0.05792704 -0.359922752  1.00000000

Scatterplots

sales_long <- sales |>
  pivot_longer(
    cols = c(sqft, year_built),
    names_to = "variable",
    values_to = "xvalue"
  )

ggplot(sales_long, aes(x = xvalue, y = sale_price)) +
  geom_point(size = 0.3, alpha = 0.5, color = "#e8e3d3") +
  geom_smooth(linewidth = 0.5, method = "lm", formula = "y~x", color = "#4b2e83") +
  facet_wrap(~ variable, scales = "free_x") +
  labs(
    x = "Explanatory Variable",
    y = "Sale Price ($)",
    title = "Sale Price vs. Square Footage of House (Sqft) and Year Built"
  ) +
  theme_minimal()

Fitting the Model

Starting to fit a regression, you define a family of models that express a precise, but generic, pattern that you want to capture. For example, the pattern might be a straight line, or a quadratic curve. You will express the model family as an equation like \(y = \beta_0 + \beta_1 x\) or \(y = \beta_0 + \beta_1 x^2\). Here, \(x\) and \(y\) are known variables from your data, and \(\beta_0\) and \(\beta_1\) are parameters that can vary to capture different patterns.

# lm means linear model
# we don't need to add intercept, R will add it for us
simple_lm <- lm(data = sales, sale_price ~ sqft + year_built)

We can check the results using summary() (result table) and confint() (confidence intervals).

summary(simple_lm)
## 
## Call:
## lm(formula = sale_price ~ sqft + year_built, data = sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1877471  -322074   -92840   173424 13832047 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.982e+06  3.268e+05   24.42   <2e-16 ***
## sqft         5.082e+02  5.975e+00   85.06   <2e-16 ***
## year_built  -4.009e+03  1.672e+02  -23.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 653700 on 15651 degrees of freedom
## Multiple R-squared:  0.3162, Adjusted R-squared:  0.3161 
## F-statistic:  3619 on 2 and 15651 DF,  p-value: < 2.2e-16
confint(simple_lm)
##                    2.5 %       97.5 %
## (Intercept) 7341424.5882 8622695.8739
## sqft            496.5235     519.9459
## year_built    -4336.6278   -3681.2999

How to read the tables?

Results from R and Interpretations
Click to view additional notes on reading the results

Hypothesis: The model test if \(Y\) is associated with \(X\). The null hypothesis \(H_0\) is that there is no linear relationship between the response variable and the explanatory variable(s), and the alternative hypothesis \(H_1\) is that there is relationship.

Estimate and standard error:In the model, estimate is the estimation of coefficient \(\beta\), std.error is the standard error for estimation. The standard error of the coefficient is an estimate of the standard deviation of the coefficient \(\beta\). In effect, it is telling us how much uncertainty there is with our coefficient. The standard error is often used to create confidence intervals.

t-statistic: The t-statistic is simply the coefficient divided by the standard error. In general, we want our coefficients to have large t-statistics, because it indicates that our standard error is small in comparison to our coefficient. Simply put, we are saying that the coefficient is \(t\) standard errors away from zero.

p-value: If you see the variable has p value \(\leq 0.05\), or you can also see the star next to \(Pr(>|t|))\), we can say we reject the null hypothesis. In practice, any p-value below 0.05 is usually deemed as significant. What do we mean when we say significant? It means we are confident that the coefficient is not zero, meaning the coefficient does in fact add value to the model by helping to explain the variance within our dependent variable. Note: reading the stars this way only works for individual variables and not categorical variables.

R-squared: R-squared explains how much variation of y can be explained by the model. This is helpful for univariate regression, but starts to become fairly unhelpful for multivariate regression. We would look at adjusted R-squared, which penalizes complexity.

Check Assumptions

For a normal linear model to be a good model, there are some conditions/assumptions that need to be fulfilled.

If these assumptions are not met, the inference with the computed standard error is invalid. That is, if the assumptions are not met, the standard error should not be trusted, or should be computed using alternative methods.

The quickest way to check many assumptions is to use plot(model) and it will generated several key plots for us. We use which to specify the plot but in practice, you can generate them at the same time.

We will setup a ‘perfect’ model as the example to compare:

example_x <- seq(1, 100, by = 1)
example_y <- 3 + 2 * example_x + rnorm(100, mean = 0, sd = 2)
example_model <- lm(example_y ~ example_x)

Residuals vs Fitted

plot(example_model, which = 1, title("Perfect Model"))

plot(simple_lm, which = 1, title("Our Linear Model"))

Horizontal Axis: Fitted values (predicted values) from the regression model.

Vertical Axis: Residuals (the differences between the observed values and the predicted values).

Interpretation: This plot helps you check for linearity and homoscedasticity (equal variance of error terms). Ideally, you should see a horizontal red line (for linearity) and random scatter of points with no discernible pattern (for homoscedasticity). If there’s a pattern (e.g., a curve or fan shape), it suggests non-linearity or heteroscedasticity, which might indicate a problem with the model.

In our case: no much concern on the linearity (red line is around 0); some level of heteroscedasticity (fan shape - residuals variance increase with the increase of fitted values). That happens a lots in price data, which is often right-skewed/long-tailed and log transformation to \(Y\) is a common strategy.

Q–Q Plot

Q-Q plots helps us to compare a set of data with normal distribution. If the data is perfectly normally distributed, it should follow the dash line. In our case, we can see the residuals are not normally distributed. It reconfirms the price data is right-skewed/long-tailed.

plot(example_model, which = 2, title("Perfect Model"))

plot(simple_lm, which = 2, title("Our Linear Model"))

Residuals vs Leverage

plot(example_model, which = 5, title("Perfect Model"))

plot(simple_lm, which = 5, title("Our Linear Model"))

Horizontal Axis: Leverage values, which indicate how much the corresponding observation influences the fit.

Vertical Axis: Standardized residuals (residuals divided by their standard deviation).

Interpretation: This plot helps you identify influential data points (outliers) and influential cases (observations with high leverage). Points that are far from the horizontal line at zero indicate observations with high standardized residuals. Points that are far from the vertical line at 1/n (where n is the number of observations) have high leverage. The gray dash line is the Cook’s Distance; points above the line have will have noticeable change to regression line if removed.

In our case: There are a few outliers (5452, 13765, 11635) but it is not a super serious problem (within Cook’s distance = 0.5). We can check whether the original data is correctly collected. After that, we can decide whether to remove them.

Multicollinearity

The Variance Inflation Factor (VIF) measures how much the variance of a regression coefficient is inflated due to collinearity with other predictors. If there is perfect multicollinearity, R will automatically drop one of the variables. If there is imperfect multicollinearity. R won’t drop and the model estimates will be biased.

#install.packages('car')
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif(simple_lm)
##       sqft year_built 
##   1.097347   1.097347

Adjustment to the Regression

Because the original model showed heteroscedasticity (the “fan shape” pattern in residual plots), we apply a log transformation to the \(Y\) sale_price. Almost everything is better after the transformation.

log_simple_lm <- lm(data = sales, log(sale_price) ~ sqft + year_built)
plot(log_simple_lm)

summary(log_simple_lm)
## 
## Call:
## lm(formula = log(sale_price) ~ sqft + year_built, data = sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.18762 -0.26969 -0.02224  0.22164  2.58006 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.774e+01  1.997e-01   88.80   <2e-16 ***
## sqft         3.717e-04  3.651e-06  101.80   <2e-16 ***
## year_built  -2.395e-03  1.022e-04  -23.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3995 on 15651 degrees of freedom
## Multiple R-squared:  0.3995, Adjusted R-squared:  0.3995 
## F-statistic:  5207 on 2 and 15651 DF,  p-value: < 2.2e-16

You may notice that the \(\beta\) in log-transformed model is quite small. Because of the transformation, our interpretation of \(\beta\) could be different! Holding other variables constant, we change \(X_1\) for one unit, we can write as:

\[ \begin{aligned} \log(Y_{changed}) & = \beta_0 + \beta_1 (X_1 + 1) + \beta_2 X_2 + \varepsilon \\ & = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon + \beta_1 \\ & = \log(Y_{original}) + \beta_1\\ \beta_1 & = \log(Y_{changed}) - \log(Y_{original}) \\ & = \log(\frac{Y_{changed}}{Y_{original}}) \\ \exp(\beta_1) & = \frac{Y_{changed}}{Y_{original}} \end{aligned} \]

We use coef() to get accurate \(\beta\):

coef(log_simple_lm)
##   (Intercept)          sqft    year_built 
## 17.7377723721  0.0003717023 -0.0023949085
exp(0.0003717023)
## [1] 1.000372

So, we can explain the relationship between sale price and square footage of house as: holding all year variables constant, the expected percentage change in price for 1 sqft increase is 0.3%.

Note: if we want to know expected percentage change in price for 100 sqft, it should be \(1.000372^{100}\) (+3.79%) instead of \(0.000372 \times 100\) (+3.72%).

Interactions and Non-linear Variables

An interaction means that the effect of one variable depends on the level of another variable. For example, we may think of the effect of house size on price changes depending on the house’s condition. We can add interaction terms by using *. Note: R will add condition itself as an explanatory variable in the following model:

log_no_interaction_lm <- lm(data = sales, log(sale_price) ~ year_built + sqft + condition)
log_interaction_lm <- lm(data = sales, log(sale_price) ~ year_built + sqft + sqft * condition)

You could also have variables in you specification that are squared, cubed etc. such as:

model7 <- lm(sale_price ~  sqft + I(sqft^2) + cos(sqft), data = sales)

The I() is a base R command that forces R to evaluate anything inside the I() before anything else. Sometimes it is needed when conducting regressions.

Compare Models

When compare with several models, we can use modelsummary package to display and compare regression results in one formatted table. It helps us quickly see how coefficients, significance levels, and model fit change.

modelsummary(
  list(
    "Without Interaction Model" = log_no_interaction_lm,
    "Interaction Model" = log_interaction_lm
  ),
  stars = TRUE,
  fmt = 5, 
  title = 'Comparsion Table between Models with/without Interaction',
  output = "html")
Comparsion Table between Models with/without Interaction
Without Interaction Model Interaction Model
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 17.71652*** 17.81729***
(0.22059) (0.22081)
year_built −0.00239*** −0.00232***
(0.00011) (0.00011)
sqft 0.00037*** 0.00026***
(0.00000) (0.00002)
condition 0.00104 −0.06512***
(0.00458) (0.01093)
sqft × condition 0.00003***
(0.00000)
Num.Obs. 15654 15654
R2 0.400 0.401
R2 Adj. 0.399 0.401
AIC 447439.8 447397.5
BIC 447478.1 447443.4
Log.Lik. −7846.290 −7824.117
F 3471.285 2621.781
RMSE 0.40 0.40

In an interaction model, the \(\beta\) of individual variables (those also appearing in the interaction term) represent their effects only when the other interacting variable equals zero. So, it is a baseline effect and does not have much direct meanings. We must always interpret them together with the interaction term. In our interaction model:

For the interaction term, we can explain as: with a one-unit increase in condition, holding other variables constant, the coefficient (slope) of sqft increases by 0.0003.

We see a significant positive interaction: the positive effect of square footage on price is stronger for houses in better condition. In other words, larger houses benefit more from good condition than smaller ones.

Performance Metrics

R-squared and Adjusted R-squared [larger - better]: Mentioned in the lecture, R-squared shows how much of the variation in the response variable is explained by the model. Adjusted R-squared corrects for the number of predictors, so it is better for comparing models with different numbers of variables.

Root Mean Squared Error (RMSE) [smaller - better]: Root of MSE, measures the average distance between the observed and predicted values. Smaller RMSE means better model fit.

Akaike’s Information Criterion (AIC) [smaller - better]: AIC is good for unknown data generating process(The AIC tries to select the model that most adequately describes an unknown, high dimensional reality. Akaike’s Information Criteria is good for making asymptotically equivalent to cross-validation. You can check AIC using AIC(model).

Bayesian Information Criterion (BIC) [smaller - better]: BIC tries to find the TRUE model among the set of candidates. A lot of times, BIC and AIC will agrees on each other, but BIC penalize models that have a lot of parameters. Danger of underfit. Bayesian Information Criteria is good for consistent estimation. You can check BIC using BIC(model).

In general, our interaction model perform better in terms of all four metrics. And we will use this model to conduct prediction. Note: we may still need same assumption check for the new model.

Prediction

We can use predict function to generate predicted values from a regression model.

# we add a new column in sales dataframe
sales$predicted <- predict(log_interaction_lm, data = sales)
# check the prediction
sales$predicted[1]
##        1 
## 13.56881

Check the prediction, we find the value is extremely small (13.57 << real sale price). We used log transformation to response variable! We need to transform the predicted values back by taking the exponential (exp) of each prediction to return them to the original sale price scale.

sales$predicted_exp <- exp(sales$predicted)
sales$predicted_exp[1]
##        1 
## 781375.6

📚 TODO: Explain the Results of Logistic Regression

10 points

You may discuss in small group, but you must prepare the submission by yourself. Generative AI is NOT allowed for write-up questions. You ask conceptual questions but you need to cite properly. How to cite?

Your submission should be HTML or PDF. Handwritten PDFs are acceptable as long as they are scanned and easy to read. Part B will not require any coding as well.

In this section, we will read a recent research paper published by Runstad Department faculty Rebecca J. Walter, Arthur Acolin, Ruoniu (Vince) Wang, Gregg Colburn, along with people from other institutions: Exploring the association between household compositional change and mobility of subsidized householders in the United States. You are able to find the PDF version within the lab folder.

This study examines the relationship between household compositional change (such as a partner, child, or adult entering or leaving the household) and residential mobility of subsidized householders in the US. They use the Annual Longitudinal Files 2005–2018 from U.S. Department of Housing and Urban Development, which is restricted-use data required researchers to apply to access. Northwest Federal Statistical Research Data Center is the center within UW partnered with U.S. Census Bureau to provide such data access.

Please skim the paper first and use the following to guide your further reading, especially about the Data and methods and Results parts.

Settings

  1. [1 pt] What is the unit of analysis of this study? What is the sample size \(N\)? Is this study cross-sectional or longitudinal (lecture 2 data)?
  2. [0.5 pt] Why do you think U.S. Census Bureau requires us apply for the data usage instead of providing public access like American Community Survey (ACS) data?
  3. [0.5 pt] In a research paper, we tend to include two parts of explanatory variables: (1) the key explanatory variables (or variables of interest), which directly address the research question; and (2) the control variables, which account for other factors that might also influence the response variable. In this paper, what are the key explanatory variables and response variable?
  4. [1 pt] What is the purpose of including local market characteristics as part of \(X\)? Even though we do not have access to the data, which variables could be multicollinear with local market characteristics?
  5. [0.5 pt] Can we use normal linear regression for this study?
  6. [0.5 pt] The follow picture shows model specification in page 8 of this paper. Do you find any problem with this equation and explanation, especially for the explanatory variable part?
Page 8 - Model Specification

Model Configurations and Results

This study use logistic regression, which is similar to linear regression but for binary response variable (1/0). We change \(Y\) with \(\log(P/(1-P))\), where \(P\) is the probability of \(Y=1\). This post from IBM would be helpful to understand logistic regression. Think about this model:

\[ \log(\frac{P}{1-P}) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 \]

  1. [1 pt] How should we interpret \(\beta_1\)?
  2. [0.5 pt] In the result tables of this paper, they use Odds Ratio rather than the coefficient of \(X\). How could you calculate the coefficient \(\beta\) for Household Member Enters or Exits (with odd ratio of 1.671)? The table below shows part of the results.
  3. [0.5 pt] How much more likely are households with a compositional change (a member entering or exiting) to move, compared to those without, controlling for other factors?
  4. [1 pt] What are the null and alternative hypotheses being tested by the t-value (398.7) for “Household Member Enters or Exits” in Model 1? What is the meaning of ‘***’?
  5. [1 pt] They conducted at least three logistic regression models. Why did they do that?
  6. [1 pt] The authors only use tables to present results. What is one way to visualize the adds ratio from this logistic model to show uncertainty of the results? You can use a plot example to illustrate.
  7. [1 pt] The title of this paper uses association rather than effect, impact, or influence. Authors are also very cautious term usages in main text. What is the reason?
Table 5 - Logistic Regression Results

Acknowledgement

The materials are developed by Haoyu Yue based materials from Dr. Feiyang Sun at UC San Diego, Siman Ning and Christian Phillips at University of Washington, Dr. Charles Lanfear at University of Cambridge.